Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Feeling a little bewildered. Lots to take in. Had forgotten about the syntax of R, been years since last time since I used it. The intro throught the book was thorough, but reading is for not the same as hands-on when it comes to coding. At least the book works as a good source for reference in the future.

Chapter 2: Regression and model validation: a case study

This week we dug into regression analysis, and model validation. Both very useful and interesting, and, I have to say, somewhat foreign to me. Too long has passeed since I last dabbed in statistics. Much reading was to be done to be able to grasp the concepts well enough. Overall I think this week gave me a good basic understanding of how to conduct regressional analysis with RStudio.

date()
## [1] "Tue Dec 13 01:46:41 2022"

This week’s dataset komes from Kimmo Vehkalahti. It contains data collected by him in a survey conducted among university students in an introductory course to social sciences. The data was collected between Dec. 2014 and Jan. 2015. The survey was conducted among 183 students, and it includes data on 56 questionnaire variables, 1 compound variable (attitude), and three contextualizing variables (age, sex, points in final exam), bringing the total to 60 columns.

This data was cleaned and made uniform as follows: - the survey measured three different approaches of learning, represented by answers in appropriate columns. These values were merged into three compound value columns corresponding to three different approaches: deep, surface (surf) and strategic (strat). The numerical value of the new columns is the mean answers pertinent to each category (original answers were on a scale of 1 to 5). - The “attitude” compound column was scaled to the same level (1-5). - the now superfluous raw data survey columns were removed, leaving 7 columns: age, sex, points, attitude, deep, surf and strat. - rows with values of zero in the “points” column (17 in total) were removed, leaving 166 rows. Since the aim of the data was to measure the impact of studying attitudes and the approaches to learning with success in exam as the main measurement of their impact, the rows without exam results were redundant. - as minor data curation procedures the naming of headers was normalized

In the following code the .csv-file containing the cleaned data is read into the variable “learning2014”, and its dimensions (dim) and structure (str) are checked to make sure eveything is as it should be.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
setwd("data")
learning2014 <- read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166   7
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

The data and its seven main variables can be summarized as follows:

library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = learning2014$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

summary(learning2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

The data includes more female (110) and younger (median age 22 years) participants. The strongest correlation with exam points is found with attitude of students and “strategic” approach to learning. Male participants show somewhat higher scores in attitude. Age correlated best with “strategic” approach to learning. The three variables “deep”, “strategic” and “attitude” show positive interralation. We will choose these last three as explanatory variables, with points being the target variable.

my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

This summary shows the significance of each explanatory variable in relation to the target variable. First it lists residuals, the difference of the predicted model and the actual value. Of the coefficients interlinked t value and p value (Pr(>|t|)) are most significant. Looking at them, it seems that the variable “deep” does not bear correlation with points (Pr(>|t| 0.31974), and is statistically non-significant. Thus we will leave it out of the equation and recalculate with two explanatory variables.

my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

A slightly better result. Attitude seems to be the single most significant single factor explaining the exam result (p < 6.31e-09). Of the different approaches to studying “strategic” bears most significance. It has a p-value of 8.9%, verging on being statistically significant, but not quite. The adjusted R-squared -value of 0.1951 tells us that c. 19.5% of the points the students got from the exam can be explained by their attitude and adherence to strategic approach to learning, with attitude accounting for almost all of this effect. The p-value is 7.734e-09, very significant, and the F-statistic 20.99 is considerably larger than Critical F on 2 and 163 degrees of freedom (~3.0). Therefore we can reject the null hypothesis.

What remains are regression diagnostics, shown below in three tables.

par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))

The residuals vs Fitted model -graph confirms that the fitted model is appropriate; the residuals and the fitted values are uncorrelated. The Normal Q-Q shows no evidence of major departure from linearity, confirming the sample data follows normal distribution and is not skewed. Residuals vs Leverage shows no data points fall outside Cook’s distance, and therefore there are no influential individual datapoints. Everything checks out.


Chapter 3: Logistic regression

This week was all about the title. The statistics were interesting and the tools powerful. Should be useful in the future. The only problem I continuously have with R is that every function seems to be like a “magic box”. I just put in some numbers, and it does its magic in secret, spewing out magnificent graphs and whatnot, but I always feel like I have no clue or maybe even control how it does what it does. Like driving with an autopilot: just enter the destination, and the car takes care of everything. Can I say that I drove there? Or even that I can drive, if all I do is input destinations? Each function works differently; this seems not so much about learning general R syntax, but instead learning which function does what, and what you need to feed it. Perhaps the feeling ensues because earlier on I have mostly done everything step-by-step by myself, each graph and such. This feels almost like cheating at times, and on the other hand like giving away control.

date()
## [1] "Tue Dec 13 01:46:49 2022"

1. Create chapter3.Rmd-file created in data wrangling -> child to index.Rmd

Done.

Also, loading all the needed libraries here.

library(scales); library(dplyr); library(tidyverse); library(ggplot2); library(boot)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor

2. Read the data into R.

3. Choose 4 interesting variables in the data

for each of them, present your personal hypothesis about their relationships with alcohol consumption. (0-1 point)

The 4 variables I have chosen to be analysed in relation with alcohol consumption are:

  1. Absences (variable name: absences; numeric, from 0 to 45)
  • hypotheses: correlates positively with high alcohol consumption
  1. Going out with friends (variable name: goout; numeric: from 1 - very low to 5 - very high)
  • hypotheses: correlates positively with high alcohol consumption
  1. Extra-curricular activities (variable name: activities; binary: yes or no)
  • hypotheses: correlates negatively with high alcohol consumption
  1. Weekly study time (variable name: studytime; numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  • hypotheses: correlates negatively with high alcohol consumption

Going out a lot with friends would probably be most highly correlated with high alcohol consumption. High alcohol consumption also could cause absences from classes. The last two, on the other hand, are such that they would limit ones ability to partake in sociable drinking.

4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption

Use for example cross-tabulations, bar plots and box plots. Comment on your findings and compare the results of your exploration to your previously stated hypotheses. (0-5 points)

This will be a “preliminary” probe, with brief overviews on each chosen variable.

4.1 Absences

Absences is a numeric value with amount of absences, high_use is a binary. Correlation, if any, should be visible from a simple boxplot.

plot_absences <- ggplot(alc, aes(x = high_use, y = absences))
plot_absences + geom_boxplot() + ggtitle("1. Amount of absences vs. high use of alcohol")

The hypotheses seems to be corrected, and the correlation significant.

4.2 Going out with friends

Going out was graded on a scale from 1 to 5, and high_use is a binary. A barplot should reveal if there are any differences between the two groups of high-users and low-users.

plot_goout <- ggplot(alc, aes(x = goout))
plot_goout + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("2. going out with friends vs. high alcohol usage") 

As visible from the barplot, the high users do go out a lot more than the low users. The barplot leans heavily to the right. A simple check of mean values and distribution percentages confirms the hypotheses.

alc %>% group_by(high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 2 × 3
##   high_use count mean_goout
##   <lgl>    <int>      <dbl>
## 1 FALSE      259       2.85
## 2 TRUE       111       3.73
alc %>% group_by(high_use, goout) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 4
## # Groups:   high_use [2]
##    high_use goout count percentage
##    <lgl>    <dbl> <int> <chr>     
##  1 FALSE        1    19 7.34%     
##  2 FALSE        2    82 31.66%    
##  3 FALSE        3    97 37.45%    
##  4 FALSE        4    40 15.44%    
##  5 FALSE        5    21 8.11%     
##  6 TRUE         1     3 2.7%      
##  7 TRUE         2    15 13.5%     
##  8 TRUE         3    23 20.7%     
##  9 TRUE         4    38 34.2%     
## 10 TRUE         5    32 28.8%

4.3 Extra-curricular activities

Extra-curricular activities is a binary value, as is high_use. A simple check of the amounts might give us an overview

alc %>% group_by(high_use, activities) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 4
## # Groups:   high_use [2]
##   high_use activities count percentage
##   <lgl>    <chr>      <int> <chr>     
## 1 FALSE    no           120 46.3%     
## 2 FALSE    yes          139 53.7%     
## 3 TRUE     no            59 53.2%     
## 4 TRUE     yes           52 46.8%

Perhaps surprisingly not a significant difference in extracurricular activities between the two groups; both basically break even. Further inquiries regarding this variable would be futile.

4.4 Studying time

Weekly study time is a numeric field (values: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours). A barplot should do the trick for overview of correlation.

plot_studytime <- ggplot(alc, aes(x = studytime))
plot_studytime + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("4. Weekly study time vs. high alcohol usage") 

alc %>% group_by(high_use, studytime) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 4
## # Groups:   high_use [2]
##   high_use studytime count percentage
##   <lgl>        <dbl> <int> <chr>     
## 1 FALSE            1    56 21.6%     
## 2 FALSE            2   128 49.4%     
## 3 FALSE            3    52 20.1%     
## 4 FALSE            4    23 8.9%      
## 5 TRUE             1    42 37.8%     
## 6 TRUE             2    57 51.4%     
## 7 TRUE             3     8 7.2%      
## 8 TRUE             4     4 3.6%

As can be seen, almost 90% of high alcohol users studied less than 5 hours per week, whereas lower alcohol users are more normally divided, with much more time used on studies. Correlation is strong here.

5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.

Present and interpret a summary of the fitted model. Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis. (0-5 points)

Okey dokey. First the code, then the analysis.

m <- glm(high_use ~ absences + goout  + activities + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + activities + studytime, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9196  -0.7732  -0.5083   0.8446   2.5676  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.34944    0.53871  -4.361 1.29e-05 ***
## absences       0.06961    0.02213   3.145  0.00166 ** 
## goout          0.73572    0.11959   6.152 7.64e-10 ***
## activitiesyes -0.31151    0.25661  -1.214  0.22476    
## studytime     -0.56049    0.16914  -3.314  0.00092 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 375.68  on 365  degrees of freedom
## AIC: 385.68
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                       OR      2.5 %    97.5 %
## (Intercept)   0.09542238 0.03227113 0.2680871
## absences      1.07208689 1.02777425 1.1223836
## goout         2.08697580 1.66084887 2.6570839
## activitiesyes 0.73234019 0.44132598 1.2094464
## studytime     0.57093162 0.40522569 0.7880221
Statistical significance:

Absences, going out and studytime are significant statistically (p-values 0.00166, 7.64e-10 and 0.00092 respectively). Activities, on the other hand, are statistically insignificant as noted already earlier. The correlations were also correctly assumed in the preliminary hypotheses in regard of the three statistically significant variables.

The odds ratios and confidence intervals:

OR and CI show that 1 absence equals 7 percent heightened likelihood of the person belonging to the group of high users, with values in real population falling between 2.7 and 12.2 percent. Similarly belonging to 1 higher group in studytime meant 43 percent reducted likelihood to belong to high users, real life population values between 59.5 and 21.2 percent. Belonging to higher group in going out -variable, on the other hand, meant a whopping 108 percent heightened likelyhood to be a high alcohol user, with real population values between 66 and 165.7 percent. The CI of activities includes 1, which shows its effect, positive or negative, can not be speculated in real population.

6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model.

Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions. Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy. (0-3 points)

We’ll use the statistically significant variables found, absences, goout and studytime.

m2 <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")

probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc2'
alc2 <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   239   20
##    TRUE     63   48
g <- ggplot(alc2, aes(x = probability, y = high_use))

# define the geom as points and draw the plot
g + geom_point(aes(col = prediction))

# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64594595 0.05405405 0.70000000
##    TRUE  0.17027027 0.12972973 0.30000000
##    Sum   0.81621622 0.18378378 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2243243

Loss function gives 22.4% falsely predicted individuals. The false positives and false negatives in cross-tabulation gives 17.0% + 5.4% = 22.4% of falsely predicted individuals, the same number. This is the total proportion of inaccurately classified individuals (= the training error).

7. Bonus: Perform 10-fold cross-validation on your model.

Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model? (0-2 points to compensate any loss of points from the above exercises)

cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351

It seems this model has better test set performance (0.23) than the Exercise Set (0.26).

8. Super-Bonus: NOT DONE


Chapter 4: Clustering and classification

Reflections: This week was horribly busy for me, and also I became ill. I barely had the time to do the exercises and the assignments. I did skim-read through Kimmo’s book chapters also, but I must say that the whole concept of how clustering works seems a bit …vague? Yes, it works, but seems that every method either/both of 1) researcher doing guesswork of where to cut the clusters or 2) the clustering having such a random-factor that the results can only be repeated by forcing a fixed random-seed. Which sounds rather strange and does not make one feel that the process is even possible to be firmly grasped. But, I will try my best. As mentioned in the book, clustering is a basic concept of reasoning, and an absolutely necessary tool for data analysis.

date()
## [1] "Tue Dec 13 01:46:51 2022"

1. Create chapter4.rmd -file etc.

Done.

# I will load all the needed libraries and the dataset here for convenience

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded

2. Load data, explore and describe it

data("Boston")

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The Boston dataset is a classic dataset that is distributed as part of the basic R package. Its title, “Housing Values in Suburbs of Boston” describes its origins well enough. The dataset was gathered for the 1978 article “Hedonic prices and the demand for clean air” (J. Environ. Economics and Management 5, 81–102) by D. Harrison and D.L. Rubinfeld. The 14 variables map factors that affected housing prices in suburbs (called “towns” in documentation; we’ll use the same term henceforth) around and include information on - for instance - crime rate, amount of nitrogen oxides in air, and access to Boston’s radial highways. Data has been already cleaned; it entails 506 rows of observations and 14 columns of variables.

The variables are:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: \(1000(Bk - 0.63)^2\) where \(Bk\) is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.

3. Summary of variables and graphical overview

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Summary

Many very natural distributions, as with airborne NoX -gases. Yet, some of the values are uneven. Crime-variable for instance varies from 0.006 to 88.976, with a median of 0.25 and mean of 3.61 - it seems that there is a huge variety (and a lot of outliers) in crime between towns, with a few of them accounting for the majority of the cases. Non-retail business, on the other hand, are divided more evenly, but wealth is clearly unevenly divided (lstat and medv). Amount of POC (variable black) does also seem to probably have a lot of outliers (min: 0.32, 1st qu: 375.38, mean 356.67). Some variables seem to bear some interconnection due to similar spread (for instance dis, rad, zn).

Plot

As can be seen, some of these values mentioned above are interconnected. Lower status and median value of homes are negatively correlated to a very high degree. Amount of rooms correlates positively with more expensive houses, as can be expected. Interestingly, the dis-variable (distances to employment centres) reveals societal structures and characters of the economical centers: older housing and smaller lots on the zoned land, more (non-retail) businesses which bring more customers and employees driving daily by cars, and hence more nitrogen dioxide in the air; also more crime and more lower status inhabitants. This dynamic seems to work as a sort of a heuristic key on how to read the data. As another instance of the same phenomenon the variable rad “access to radial highways” connects all the datapoints pertinent to the same dynamic: highways connect major centers. Crime rate is connected with this variable to a high degree for this reason.

As an additional check, I will do some boxplots of some of the variables mentioned in the summary analysis. The amount of outliers makes it clear that we are dealing with data that is not evenly spread, and should be checked for clusters. As a preliminary hypothesis we might say that most likely they would be built around the dynamic of more populous and dense commercial towns vs. less populous smaller towns, with much less crime, pollution, commerce, etc.

par(mfrow = c(1, 5))
boxplot(Boston$crim, main = "crim", col = "purple", outcol = "red")
boxplot(Boston$zn, main = "zn", col = "blue", outcol = "red")
boxplot(Boston$lstat, main = "lstat", col = "yellow", outcol = "red")
boxplot(Boston$medv, main = "medv", col = "orange", outcol = "red")
boxplot(Boston$black, main = "black", col = "green", outcol = "red")

4. Standardizing, creating categorical variable and “train” and “test” -datasets

Next, we will standardize the dataset and print out summaries of the scaled data. The data will be scaled so that all the mean values of all the variables will be 0 [zero].

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Then, as mentioned in the chapter title, we will create a categorical variable to replace the old crime value with, and the training and testing datasets.

boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
##        zn               indus              chas              nox         
##  Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723   Min.   :-1.4644  
##  1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723   1st Qu.:-0.9121  
##  Median :-0.48724   Median :-0.2109   Median :-0.2723   Median :-0.1441  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723   3rd Qu.: 0.5981  
##  Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648   Max.   : 2.7296  
##        rm               age               dis               rad         
##  Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median :-0.1084   Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black             lstat        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033   Min.   :-1.5296  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049   1st Qu.:-0.7986  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808   Median :-0.1811  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332   3rd Qu.: 0.6024  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406   Max.   : 3.5453  
##       medv              crime    
##  Min.   :-1.9063   low     :127  
##  1st Qu.:-0.5989   med_low :126  
##  Median :-0.1449   med_high:126  
##  Mean   : 0.0000   high    :127  
##  3rd Qu.: 0.2683                 
##  Max.   : 2.9865
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

5. LDA on the train set

Next we’ll fit the linear discriminant analysis on the train set. Categorical crime rate is the target variable, with the other variables as predictors.

lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The variable rad has highest correlation with crime rate, and mainly through it we ware able to cluster the towns with highest crime rates. The second and third most important factors are nox and zn, which basically differentiate the towns in order of their population density.

6. Predicting with LDA on the test data

Next we’ll use the test set data to predict the crime rate classes, and cross-tabulate them with actual results. First we’ll save them as their own variable.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      14        0    0
##   med_low    2      10        5    0
##   med_high   1      11       16    2
##   high       0       0        0   27

As can be seen, the model does predict most of the cases correctly, with more accuracy in the low and high ends of the spectrum.

7. K-means clustering

Following the assignment, we’ll first reload the dataset and standardize it to bring the means to zero.

data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Next, we’ll count the Eucledian distances between the observations, and check out the summary of the values.

dist_eu <- dist(boston_scaled)
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Next we’ll run a k-means algorithm on the dataset.

set.seed(13)

km <- kmeans(boston_scaled, centers = 4)

pairs(boston_scaled, col = km$cluster)

Four clusters seems like a lot. Let’s try to gauge the optimal number of clusters via WCSS -analysis. A sharp turn in the graph should indicate where wouuld be a good value to make the cut.

set.seed(123)

k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Optimal point for clustering seems to be 2. Let’s rerun the k-means analysis with 2 clusters.

set.seed(13)

km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Not very informative due to the small size. Let’s take into closer look the variables we denoted correlated earlier: crim, zn, nox, dis, rad, lstat and medv.

pairs(boston_scaled[, c(1, 2, 5, 8, 9, 13, 14)], col = km$cluster)

Some of these might be still dropped, but in the main it seems that these two clusters denote meaningful and significant differences between two groups of opservations in the data. The commercial and economical centers have smaller lots, poorer air quality, better access to ringroads, more poor people, and less expensive homes. Further analysis would be needed to elaborate more.

DONE

Bonus: NOT DONE

Just not enough time. Being ill with a small baby has its drawbacks

Super-Bonus: FIRST PART DONE

Yet, this seems intriguing. Who doesn’t want more dimensions? Gotta try, I’ll give it “15 minutes”, and just list the instructions here as I go through them. Not for the points, but for the coolness factor of 3D thingies.

Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Next, install and access the plotly package.
# install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
Create a 3D plot (cool!) of the columns of the matrix product using the code below.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')

#WOW

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)

It is beautiful. Oh wait, I can move it! C O O L

Draw another 3D plot where the color is defined by the clusters of the k-means.
# plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)

Okey dokey. This produces error. Probably because the km$cluster is not part of the data matrix that we created… I should join the values from the cluster-column to the train-set, and then create the data matrix and the 3D-plot again… but this time I think I will just go to bed and sleep.

Peace out.


Assignment 5 - Dimensionality reduction

date()
## [1] "Tue Dec 13 01:47:01 2022"

Some reflections: has been a really tiring week, and I’m writing this in the middle of the night. So, the analyses are a little bit more lively than usual, sorry for that. Not much sleep due to having a small baby, but I’m still glad that I have managed to do this course. This week reading through Kimmo’s work I kinda felt that I actually understood the concepts. The same was with the assignments: reading through them they seemed to be rather easy and clear, much more focused on rehearsing or repeating what was already learned earlier, than on the previous weeks. However, the schedules and sleep deprivations proved such that I was not able to do the exercises and course diary until tonight.

0. Import data & libraries

library(dplyr)
library(tidyverse)
library(ggplot2)
library(GGally)
library(corrplot)
library(FactoMineR)

getwd()
## [1] "C:/Users/lauri/Desktop/IODS7/IODS-project"
setwd("data")
human <- read.csv("human.csv", row.names = 1)

1. Data, summary and graph overview

This week’s data comes from the UN. The data originally comes from two different reports, those of the Human development index (HD) and Gender inequality index (GII). These indexes are measured through various variables (the HD 8 variables and GII 12 variables), and each country and region is ranked according to their index. More information on the data and the variables here and here.

The data was first combined based on the country’s name. At this point the data included 195 observations of 19 variables. The data was then modified by creation of two combined columns: ratio of females to males in secondary education, and the same ratio in labor force (these were split by gender in the original data). So: labF / labM -> labFM and mutatis mutandis the same for secondary education -data. The rows with missing data, and the rows which combined information on continent -level were removed. The country names were transferred from “country” -column to row names. And finally, only the columns pertinent to this inquiry were kept, amounting to 8 variables:

  • Edu2.FM - ratio of females to males in secondary education
  • Lab.FM - ratio of females to males in labor force
  • Life.Exp - life expectancy
  • Edu.Exp - expected average education, in years in school
  • GNI - gross national income
  • Mat.Mor - maternal mortality
  • Ado.Birth - adolescent birth rate
  • Parli.F - percentage of females in parliament

Let’s take a look at the variables.

summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
cor(human) %>% corrplot

ggpairs(human)

A lot to unpack. First of all, some variables have really wild scales with massive differences. Mat.Mor varies from 1 to 1100, with median 49; GNI varies from 581 to 123124 with a median of 12040; Ado.Birth also varies from 0.6 to 204.8 with a median of 33.6. Some also seem skewed, with means differing from the median significantly. Huge differences between countries, but also huge differences between the scales of each variable (some have max values close to 1, some around 50-200, and GNI over 120000). According to Kimmo’s book p. 246, “when variables are on very different scales or have very different variances, a PCA of the data should be performed on the correlation matrix” - just to keep this in mind.

Looking at the correlation plots it’s easy to see that the variables do have massive correlations. Mat.Mor and Ado.Birth correlate strongly and negatively with GNI, Edu2.FM, Life.Exp and Edu.Exp. This division between the two first mentioned with almost identical correlations, and the four latter with almost identical correlation plots - but inverse from the first two - characterizes the whole data. Rich against poor. The rich live longer and receive better education, while the poor countries are here characterized by maternal mortality and adolescent births, which reflect negatively on the other variables.

Women participating in labor force (or is “work force” the correct term?) and women in parliament do not seem to tip scales much.

2. PCA on the raw (non-standardized) human data

Rrrright. As mentioned, the scales are wildly different between variables. So, we already know where this is going, but let’s do it anyway.

pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

The almighty GNI explains everything. Perhaps stupidly drawn plots like these explain modern economic policies? Be as it may, this approach certainly does not work very well in visualizing our data: one variable, GNI, is on massively larger scale compared to the others, and therefore dominates the output, explaining 99% of the variance only because its numbers are the biggest and baddest. Let’s try this again with more reason.

3. PCA on the standardized human data

So, the same again, but with standardized data and some labels.

human_std <- scale(human)

pca_human <- prcomp(human_std)

s <- summary(pca_human)

# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

A bit more sensible outcome. Now we can actually see the significant factors at work. As noted earlier, the 2 vs 4 -dynamic is what separates the countries most, and accounts for 53.6% of the variance. Interestingly here we can also see the Parli.F and Labo.FM -variables, which account for 16.2% of the variance.

4. Interpretation

Right. I kinda feel like I already gave most of my interpretation earlier, but let’s recap. The primary dynamic in the data is the division between poor and rich countries. Rich countries with high GNI score better also on life expectancy, expected secondary education and ratio of women in secondary education (to the left in the chart).

Poorer countries fare worse in these factors, but the main visible factors were maternal mortality and adolescent births: the countries with high numbers in these variables were very distinctly set apart from the countries in the first group, and are found to the right from the origo.

Interestingly, the ratio of women in parliamentary roles or participating in the work force were not significantly correlated with this rich-poor -dynamic, but formed a secondary axis within this division. It seems women in parliament and/or workforce are not as important as education in relation to the other variables.

5. Oh,would you look at the clock! Put the kettle on, dear, - and fetch the scones!

Its T-time.

Sorry about that, don’t know what got into me. Back to R.

tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
view(tea)

# I'll remove the age-variable, as it's the only numerical one, and the following pivot_longer doesn't work with it 
tea <- select(tea, -age)

# In two parts
pivot_longer(tea[1:18], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(size = 6))

pivot_longer(tea[19:35], cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(size = 6))

Plenty of variables. Tea is very interesting, I must say. As per the assignment, let’s select some variables, and do the MCA. We’ll go with tea, age_Q, escape.exoticism, exciting, relaxing and spirituality, because why the hell not. Should be weird. Maube we’ll find out what tea people in different age groups drink to relax and have spiritual, exotic or escapistic experiences. For science!

keep_columns <- c("Tea", "age_Q", "escape.exoticism", "exciting", "relaxing", "spirituality")
t_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# check out the new data
str(t_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
summary(t_time)
##         Tea        age_Q                escape.exoticism        exciting  
##  black    : 74   +60  :38   escape-exoticism    :142     exciting   :116  
##  Earl Grey:193   15-24:92   Not.escape-exoticism:158     No.exciting:184  
##  green    : 33   25-34:69                                                 
##                  35-44:40                                                 
##                  45-59:61                                                 
##         relaxing             spirituality
##  No.relaxing:113   Not.spirituality:206  
##  relaxing   :187   spirituality    : 94  
##                                          
##                                          
## 
pivot_longer(t_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + 
  geom_bar() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 6))

Next, the MCA.

mca <- MCA(t_time, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = t_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.255   0.243   0.194   0.174   0.172   0.160   0.148
## % of var.             15.289  14.556  11.622  10.434  10.306   9.619   8.894
## Cumulative % of var.  15.289  29.845  41.466  51.900  62.206  71.824  80.718
##                        Dim.8   Dim.9  Dim.10
## Variance               0.130   0.109   0.082
## % of var.              7.829   6.518   4.935
## Cumulative % of var.  88.547  95.065 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                    |  0.851  0.948  0.330 | -0.466  0.299  0.099 | -0.040
## 2                    |  0.674  0.595  0.232 |  0.476  0.311  0.115 | -0.909
## 3                    |  0.134  0.023  0.015 | -0.198  0.054  0.033 | -0.237
## 4                    | -0.921  1.109  0.692 | -0.340  0.158  0.094 | -0.229
## 5                    | -0.413  0.223  0.114 | -0.075  0.008  0.004 | -0.778
## 6                    | -0.374  0.183  0.155 | -0.463  0.294  0.238 |  0.312
## 7                    |  0.096  0.012  0.006 | -0.592  0.482  0.218 |  0.183
## 8                    |  0.554  0.402  0.152 | -0.897  1.106  0.398 |  0.110
## 9                    |  0.096  0.012  0.006 | -0.592  0.482  0.218 |  0.183
## 10                   |  0.554  0.402  0.152 | -0.897  1.106  0.398 |  0.110
##                         ctr   cos2  
## 1                     0.003  0.001 |
## 2                     1.423  0.421 |
## 3                     0.097  0.048 |
## 4                     0.090  0.043 |
## 5                     1.042  0.403 |
## 6                     0.168  0.108 |
## 7                     0.058  0.021 |
## 8                     0.021  0.006 |
## 9                     0.058  0.021 |
## 10                    0.021  0.006 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black                |   0.888  12.719   0.258   8.785 |  -0.731   9.068
## Earl Grey            |  -0.501  10.567   0.453 -11.638 |   0.170   1.270
## green                |   0.940   6.353   0.109   5.713 |   0.649   3.181
## +60                  |   1.053   9.188   0.161   6.935 |  -0.593   3.055
## 15-24                |  -0.957  18.365   0.405 -11.004 |  -0.354   2.636
## 25-34                |  -0.088   0.116   0.002  -0.830 |   0.845  11.288
## 35-44                |   0.466   1.896   0.033   3.163 |  -0.736   4.961
## 45-59                |   0.581   4.484   0.086   5.072 |   0.429   2.572
## escape-exoticism     |  -0.248   1.897   0.055  -4.057 |   0.083   0.225
## Not.escape-exoticism |   0.222   1.705   0.055   4.057 |  -0.075   0.202
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                  0.175  -7.238 |  -0.266   1.499   0.023  -2.630 |
## Earl Grey              0.052   3.937 |  -0.072   0.291   0.009  -1.683 |
## green                  0.052   3.944 |   1.020   9.847   0.129   6.200 |
## +60                    0.051  -3.902 |   0.482   2.537   0.034   3.177 |
## 15-24                  0.055  -4.068 |   0.144   0.548   0.009   1.657 |
## 25-34                  0.213   7.988 |   0.811  13.029   0.197   7.668 |
## 35-44                  0.083  -4.991 |  -0.196   0.441   0.006  -1.330 |
## 45-59                  0.047   3.749 |  -1.307  29.887   0.436 -11.418 |
## escape-exoticism       0.006   1.364 |  -0.702  20.070   0.443 -11.507 |
## Not.escape-exoticism   0.006  -1.364 |   0.631  18.038   0.443  11.507 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## Tea                  | 0.453 0.197 0.135 |
## age_Q                | 0.521 0.357 0.540 |
## escape.exoticism     | 0.055 0.006 0.443 |
## exciting             | 0.008 0.506 0.005 |
## relaxing             | 0.190 0.380 0.037 |
## spirituality         | 0.303 0.009 0.002 |
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")

As we can see, these chosen variables do not account for much variance in the data: only 15.29 + 14.56 = 29.85%. So, all the following analyses should be taken with a pinch of salt. Or a handful. But, it seems that there is a NO-camp and a YES-camp on the question of whether tea as a substance awakens feelings, such as spirituality, escapism, excitement and/or relaxation. The younger one is, the more prone one is to succumb to such frivolous emotions. The NO-camp, proudly heralded by the 60+ -class, considers tea not exciting, not relaxing, not spiritual and most definitely not in anyway connected with escapist exoticism. The best tea to drink in the NO-camp is black (as ones soul), but some also go with green (as envy). Earl Grey, on the other hand, is for the babies in the YES-camp.

On a more serious note, not much clustering. The oldest age groups contributed most to the analyses, followed by black and green teas (judging by the values on the v-test). But, at the moment I think I’m just too tired to analyse anything further. I’ll just have a nice cup of Earl Grey and commit these to the Hub.


Assignment 6 - Analysis of longitudinal data

date()
## [1] "Tue Dec 13 01:47:12 2022"

0.

Reflections: doing this the last day. Has been exhausting; been sick with a tenacious flu, and the baby doesn’t like to sleep. Reflecting more on the whole course than just this week. I had used R briefly earlier, but for very different uses, and in a very different way. And of course some 6 years ago or so. So everything was not new, yet new. I found out that I need to grasp statistical concepts more firmly - my understanding in them is on too shaky a ground to make me feel comfortable doing analyses.

Overall a good course in such sense that it gave me glimpses of something else. Yet, I cannot say that I would master these methods after this brief intro. So, perhaps next I need to take deeper looks into statistical analyses on the ground level and R in connection to them. My approach seems to be inverse: starting from the more complicated, advancing to the simpler stuff. Yet, I think that the analyses I aim to do with R might benefit from this course. I just need some more time with the fundamentals. Perhaps more hands-on “DIY”-thing would have been better: this course was intro, and as such, entailed lots of copying of ready code. That of course is what we all do, but to get something to stick to ones spine, a workflow to become the norm, one has to really hack it deep into the ape-brain. And for me the only way to achieve this is by repetition. So, perhaps the next step for me is doing some reps to get some gains on them old muscles; here in this course I had the PT to give me pointers on the technique.

Onwards to the assignment. This seems very applicative. But a lot of work, and I have only a few hours. I’ll do what I can; first the graphs and such, and if I have time I will add the verbal analysis.

EDIT:

Just couldn’t keep my eyes open any longer, so did not finish the assignment properly. Most of the charts are done, but what is lacking is the analysis. Also the neat and tidy formulations to R Markdown are not done: some of the explanations are inside code blocks.

Apologies in advance, I’m afraid this will not be that easy to follow or give points.

Here I’ll load all the necessary libraries

library(tidyverse)
library(ggplot2)
library(tidyr)
library(dplyr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

1. Analyses of Chapter 8 of MABS-book, but with RATS -data.

Import data from the data wrangling set, check everything is ok.

RATSL <- read_csv("data/rats.csv")
## Rows: 176 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): WD
## dbl (4): ID, Group, Weight, Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# as evident from the glimpse, gotta redo the categorical variables into factors

RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)

Okey dokey. The assignment was to reproduce the charts and tables of MABS-book chapter 8 with RATS-data. I will go through the charts and tables by their numbers in the book, and redo them if possible. Analysis will be added as the final stage, if I have time.

NB! There are some charts / tables that I will not reproduce, since they are not relevant for the analysis. Such tables are for instance an example of the first 40 rows of data, and Table 8.2 “Possible summary measures”, which is not calculated from the data.

IF NEED BE My data RATSL is in long form, which might not be the most useful for all the things needed. So, just in case, I will also reload the raw data in wide form.

RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", 
                   header = TRUE, sep = '\t')
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
str(RATS)
## 'data.frame':    16 obs. of  13 variables:
##  $ ID   : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD1  : int  240 225 245 260 255 260 275 245 410 405 ...
##  $ WD8  : int  250 230 250 255 260 265 275 255 415 420 ...
##  $ WD15 : int  255 230 250 255 255 270 260 260 425 430 ...
##  $ WD22 : int  260 232 255 265 270 275 270 268 428 440 ...
##  $ WD29 : int  262 240 262 265 270 275 273 270 438 448 ...
##  $ WD36 : int  258 240 265 268 273 277 274 265 443 460 ...
##  $ WD43 : int  266 243 267 270 274 278 276 265 442 458 ...
##  $ WD44 : int  266 244 267 272 273 278 271 267 446 464 ...
##  $ WD50 : int  265 238 264 274 276 284 282 273 456 475 ...
##  $ WD57 : int  272 247 268 273 278 279 281 274 468 484 ...
##  $ WD64 : int  278 245 269 275 280 281 284 278 478 496 ...

TABLE 8.1 BPRS Measurements from 40 Subjects

-> this is just giving overview of the data, we’re not going to repeat it.

FIGURE 8.1 Individual response profiles by treatment group for the BPRS data.

# has to have the linetype = ID for this graph to work. 

ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))

FIGURE 8.2

Individual response profiles for BPRS data after standardization.

# same as above, but with standardized values

RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate(stdweight = (Weight - mean(Weight))/sd(Weight)) %>%
  ungroup()

ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) + 
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "standardized weight")

FIGURE 8.3

Mean response profiles for the two treatment groups in the BPRS data.

RATSS <- RATSL %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = sd(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
# glimpse(RATSS)

# Plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  geom_point(size=3) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

FIGURE 8.4

Boxplots for the BPRS data.

glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight    <dbl> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
ggplot(RATSL, aes(x = WD, fill = Group, y = Weight))+ 
  geom_boxplot()

TABLE 8.2

Possible Summary Measures] –> This is something else, not needed

FIGURE 8.5

Boxplots of mean summary measures for the two treatment groups in the BPRS data.

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATSL_1 <- RATSL %>%
  group_by(Group, ID) %>%
  summarise(mean = mean(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(RATSL_1)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274.…
# Draw a boxplot of the mean versus treatment
ggplot(RATSL_1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(weight)")

FIGURE 8.6

Boxplots of mean summary measures for the two treatment groups in the BPRS data, without the outlier shown in Figure 8.5.

# Create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL_2 <- RATSL %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  filter(mean < 590) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
ggplot(RATSL_2, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(Weight)")

I removed just one outlier; not sure if here was intended to remove all of them.

All of the three following are “grouped together” for the Anova -analysis

TABLE 8.3

Results From an Independent Samples t-test on the Mean Summary Measure for the BPRS Data, Without the Outlier Shown in Figure 8.5

TABLE 8.4

Results from an Analysis of Covariance of the BPRS Data with Baseline BPRS and Treatment Group as Covariates

TABLE 8.6

Results from an Independent Samples t-test for the Mean Summary Measure Used on the Data Partially Shown in Table 8.5. (a) Leaving Out Subjects with Any Missing Value, (b) Mean of Available Values for Each Subject

# Not sure what needs to be done here, so I'll recreate what was in the Exercise

# t.test(mean ~ Group, data = RATSL11, var.equal = TRUE) NOT suitable, because three factors, not two: just Anova, then

# Add the baseline from the original data as a new variable to the summary data
RATSL_3 <- RATSL_1 %>%
  mutate(baseline = RATS$WD1) # The original raw RATS data was needed after all!

RATSL_3 <- RATSL_3 %>%
  filter(mean < 590)
  

# Fit the linear model with the mean as the response 
fit <- lm(mean ~ baseline + Group, data = RATSL_3)

# Compute the analysis of variance table for the fitted model with anova()
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 207029  207029 3272.6190 5.751e-15 ***
## Group      2   1226     613    9.6878  0.003748 ** 
## Residuals 11    696      63                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2. Analyses of Chapter 9 of MABS-book, but with BPRS -data.

Load data from data wrangling

BPRSL <- read_csv("data/bprs.csv")
## Rows: 360 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): weeks
## dbl (3): treatment, subject, bprs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(BPRSL)
## Rows: 360
## Columns: 4
## $ treatment <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
# Same as above, gotta redo the categorical variables into factors

BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

Same thing here as above, just in case loading the raw data

BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", 
                   sep =" ", header = T)
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
str(BPRS)
## 'data.frame':    40 obs. of  11 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ week0    : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week1    : int  36 68 55 77 75 43 61 36 43 51 ...
##  $ week2    : int  36 61 41 49 72 41 47 38 39 51 ...
##  $ week3    : int  43 55 38 54 65 38 30 38 35 55 ...
##  $ week4    : int  41 43 43 56 50 36 27 31 28 53 ...
##  $ week5    : int  40 34 28 50 39 29 40 26 22 43 ...
##  $ week6    : int  38 28 29 47 32 33 30 26 20 43 ...
##  $ week7    : int  47 28 25 42 38 27 31 25 23 39 ...
##  $ week8    : int  51 28 24 46 32 25 31 24 21 32 ...

Same thing as above, will do the charts and tables (and list them here with reference to the book), but did not have time to write analysis.

NOTE: all the chart “names” will refer to Rats, even though I am using BPRS here, because the chart names came from the book.

##TABLE 9.1 Body Weights of Rats Recorded Over a 9-Week Period

This is just viewing the data, i.e.

view(BPRS)

TABLE 9.2

Long Form of the Data for the First Two Rats in Group 1 in Table 9.1

#already done in wrangling, just glimpsing here

glimpse(BPRSL)
## Rows: 360
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs      <dbl> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …

FIGURE 9.1

Plot of weight against time for rat data, ignoring the repeated-measures structure of the data but identifying the group to which each observation belongs.

RRrright… With the BPRS -data this is probably close enough

ggplot(BPRSL, aes(x=weeks, y = bprs, group = treatment, colour = treatment)) + 
  geom_point()

TABLE 9.3

Results from Fitting a Linear Regression Model to Rat Data with Weight as Response Variable, and Group and Time as Explanatory Variables, and Ignoring the Repeated-Measures Structure of the Data

BPRSL_week <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))


BPRSL_lm <- lm(bprs ~ week + treatment, data = BPRSL_week)
summary(BPRSL_lm)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL_week)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

FIGURE 9.2

Plot of individual rat growth profiles.

ggplot(BPRSL_week, aes(x = week, y = bprs, linetype = subject, colour = subject)) +
  geom_line() + 
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(name = "bprs")

FIGURE 9.3

Scatterplot matrix of repeated measures in rat growth data.

BPRSL_week_only <- BPRSL_week[, -3]
pairs(BPRSL_week_only)

TABLE 9.4

Results from Fitting Random Intercept Model, with Time and Group as Explanatory Variables, to Rat Growth Data

BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL_week_only, REML = FALSE)

# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL_week_only
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

TABLE 9.5

Results from Fitting the Random Intercept and Slope Model, with Time and Group as Explanatory Variables, to Rat Growth Data

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL_week_only, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL_week_only
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL_week_only
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TABLE 9.6

Results from Fitting the Random Intercept and Slope Model that Allows for a Group × Time Interaction to Rat Growth Data

BPRS_ref2 <-lmer(bprs ~ week * treatment + (week | subject), data = BPRSL_week_only, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL_week_only
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

FIGURE 9.4

Fitted growth rate profiles from the interaction model and observed growth rate profiles.

Fitted <- fitted(BPRS_ref2)

# Create a new column fitted to RATSL
BPRSL_f <- BPRSL_week_only %>% mutate("Fitted" = Fitted)

# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL_f, aes(x = week, y = Fitted, linetype = subject)) +
  geom_line() +
  facet_grid(. ~ treatment, labeller = label_both) +
  #scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 60, 20)) +
  scale_y_continuous(name = "bprs") +
  theme(legend.position = "none")

# must be divided into two, otherwise will look jagged

TABLE 9.7

First Five Patients in Each Treatment Group of the “Beat the Blues” (BtB) Clinical Trial of CBT for Depression

–> not interesting, not done

FIGURE 9.5

Box plots of BDI scores by occasion of recording and treatment group.

# this is just the same exercise as in Exercise 6 copied beneath, we have pretty much done every meaningful chart


BPRSL8S <- BPRSL_week %>%
  filter(week > 0) %>%
  group_by(treatment, subject) %>%
  summarise( mean=mean(bprs) ) %>%
  ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(bprs), weeks 1-8")

FIGURE 9.6

Scatterplot matrix of BDI scores.

Emm… already done.

TABLE 9.8

Results from Fitting a Multiple Linear Regression Model to BtB Data Assuming the Repeated Measurements of BDI are Independent

Same, repeating the same from above (the book changed datasets in the middle)

The END

Thank you for reading thus far, and apologies for the really awful format of this last assignment. Just didn’t have time, due to life. Score accordingly. Peace out!